/********************************************************************************
Discrimination in Multi-Phase Systems: Evidence from Child Protection

Created on: 12/28/2022
Last Modified on: 2/18/2024

Description: This program generates a sample of child by investigations from 
2008 to 2019. It first takes a dataset of investigations from 2008 to 2017, 
built in Gross and Baron (2022). We match this dataset to a list of primary CPS 
workers in the state's HR records, which contain demographic information. We then 
extend this dataset to 2019, by incorporating a more recent dataset from 2017 to 2019.

Note that we have removed the file directory names from this program for 
confidentiality reasons. 
********************************************************************************/

**************************
**(0) SETUP
**************************
clear
set more off
macro drop all
capture log close

*Set directories
global clean 
global cleandata 
global tmpdata 
global rawdata 
global output 

**************************
**(1) MATCH THE GROSS AND BARON (2022) DATASET TO WORKERS FROM THE STATE'S HR LIST FROM 2008 TO 2017
**************************
use "${clean}master_child_panel_withpost.dta", clear //this dataset contains all investigations from 2008 to 2017 in Gross and Baron's sample//
merge m:1 worker_id using "${tmpdata}inv_insample.dta", keep(3) keepus(worker_id) //these are primary CPS workers from the state, for whom we have names and therefore can impute demographic info 
save "${clean}child_investigation_sample.dta", replace 

**************************
**(2) GENERATE ANALYSIS SAMPLE BY APPENDING 2008-2017 AND 2017-2019 DATA 
**************************
use "${clean}child_investigation_sample.dta", clear 

*Drop duplicates by victim and investigation - should be a victim X inv sample 
gegen tag=tag(vicid inv_caseid)
keep if tag==1 

*This sample is from January 1 2008 to October 23 2017 - append to data from April 24 2017 to June 30 2019 (drop calls from April 24 2017 to October 23 2017)
keep if complaint_date<20933
append using "${tmpdata}sample_inv_2017_2019.dta"
format cw_date_stata %td
replace complaint_date = cw_date_stata if complaint_date==.
drop cw_date_stata 

**SAMPLE RESTRICTIONS
*Keep only non-repeat investigations  
sort vicid complaint_date inv_caseid, stable  
cap drop diff 
bysort vicid: gen diff = complaint_date[_n] - complaint_date[_n-1]
order diff 

drop if diff<365 & diff!=.

*Drop sexual abuse cases 
drop if sexab==1 

*Drop observations with missing zipcodes 
drop if zipcode_vic==. 

*Keep only white and black children
keep if white==1 | black==1 

*Limit to investigators with at least 200 cases
bysort worker_id: gen n=_N
drop if n<200

*Generate rotation and drop "trivial rotations"
egen rotationgroup = group(zipcode_vic cps_year)
bysort rotationgroup: gen nobs = _N 
tab nobs if nobs<10 
drop if nobs ==1 

*Drop investigators who were *only* assigned to white or black children 
gegen tmp = var(white), by(worker_id)
order tmp 
drop if tmp==0 
drop tmp 

gegen tmp = var(black), by(worker_id)
order tmp 
drop if tmp==0 

*Drop observations that we can't follow for at least six months 
drop if (postm1_inv==. | postm2_inv==. | postm3_inv==. | postm4_inv==. | postm5_inv==. | postm6_inv==.) & complaint_date<20933
rename black d_black

*Generate main outcomes 
forvalues j = 1(1)6 {
gen inv_`j'm = 0 
forvalues i = 1/`j' {
	sum postm`i'_inv
	replace inv_`j'm = 1 if postm`i'_inv==1
	sum inv_`j'm* 
}
}

replace inv6m = inv_6m if inv6m==.

foreach x in inv6m {
	replace `x'=. if fc==1
}

sum inv*m

*Generate remaining variables
gen nofc=fc==0

//Count of cases
cap drop count_inv
bys worker_id: gen long count_inv = _N

// Count of cases by investigator by race:
*rename pre_black d_black
bys worker_id: egen count_black = total(d_black)
gen nonblack = (d_black==0)
bys worker_id: egen count_white = total(nonblack)
bys worker_id: egen share_black=mean(d_black)

sum d_black 
local bshare = r(mean)
gen bshare = `bshare'
cap drop pre_blac 
rename d_black pre_black 

save "${cleandata}analysis_sample_investigators_qje.dta", replace 